home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Aminet 30
/
Aminet 30 (1999)(Schatztruhe)[!][Apr 1999].iso
/
Aminet
/
text
/
misc
/
TextInfo.lha
/
TextInfo
/
TextInfo.doc
< prev
next >
Wrap
Text File
|
1999-01-07
|
17KB
|
402 lines
**********************************************************************
* TextInfo 1.3 by Erik Spåre (Parsec/Phuture 303) 990107 *
* Filematching routines by Anders Vedmar (Axehandle) *
**********************************************************************
IS THIS A PROGRAM FOR YOU?
TextInfo's task is to count all the various _unique_ words in texts
and list them. "Mine your mine" = 3 words, 2 unique (mine, your).
Read. If you find the following facts interesting, then this may
be a program for you.
! Two Cities by Dickens has almost twice as many unique words
as the Koran, despite the fact that the Koran is bigger. (Two
Cities 8041/138384, Koran 4300/152164). Plato's the Republic
(translated, as the Koran) has 45 % more (6199/127005).
! Moby Dick has 13403/213486 words; that is 27 % more than
the Bible's 10560/812394, even though Moby Dick's size is only
about one quarter of the Bible's!
! A friend (greetings Théonore) told me that he had heard that
there was a word in the Bible that occured 666 times, and it
was the name of the Beast. There is no such word in King
James Bible... :(
! You need only to know the meaning of 48 words to understand
half of the words written in the Bible... (37 in the Koran,
44 in the Republic, 64 in Two Cities and 87 in Moby Dick).
It is indeed as Axehandle said: "One would have supposed that Alah
had a bigger vocabulary..."
TEXTINFO IS...
100 % Assembler (d'oh!)
Freeware
REQUIREMENTS
OS 2.04+ and a mind that is more interested in how many grains of
sand the average beach contains, than the latest sport results or
soap operas.
INTRODUCTION
"`Don't be afraid to hear me. Don't shrink from anything I say. I
am like one who died young. All my life might have been.'
`Is it not--forgive me; I have begun the question on my lips--a
pity to live no better life?'
`God knows it is a shame!'"
/ Charles Dickens, Two Cities.
"Here is wisdom. Let him that hath understanding count the number
of the beast: for it is the number of a man; and his number is Six
hundred threescore and six."
/ The Bible.
"O ye who believe! Approach not coding while ye are drunk, until
ye well know what ye type."
/ The Koran.
The idea to this program was given to me one day when I was reading
the Koran. It was so amazingly boring, so, naturally, my thoughts
were not very busy with the text that some low priority subroutine
of my brain supplied, but with the things I could do after I had
finished reading the current chapter. Sometimes I discussed with
myself how bored I was.
"Have I ever been this bored?"
"I don't know."
"How is it possible to constantly reiterate similar phrases, and
call the result wisdom?"
"He was probably drinking wine -- laudanum most likely -- and got
a bit excited."
And one day the idea to this program was born.
"I wonder how many different words this book contains."
"Not many, I'm sure."
"No... I wonder how many..."
"Maybe you could count them with a clever program?"
The program proved me right, needless to say; but it also made
me interested in other things about etexts, like how many words
that is responsible for a certain percentage of the word total...
The idea to release this program seemed in the beginning a bit
absurd -- why would anyone use it? -- but now... I don't think
that just I and Axehandle find things like this interesting.
WHAT DOES THE PROGRAM DO?
It goes through textfiles, counts the bytes, letters, words, unique
words and more. Here is a sample of a list that was produced from
all the textfiles (not exceeding 5MB) on the CD Project Gutenberg
(Nov 94).
Bytes read 91502237 Total amount of bytes read
Letters 61101533 Total amount of letters
Words 13925117 Total amount of words
Unique 94816 Number of unique words
Subsumes 19681 Words with word-stem + ending
Syllabications 7210 Executed syllabications
Truncated chars 294 Chars exceeding max wordlength
Truncated words 27 Words that were too long
Examined file/s 204 Files that matched
the 791561 After the above info, the list
and 496698 of words follow. Unless
of 440919 specified otherwise, all the
... unique words are listed.
zygote 1
zyuganov 1
aarons(1)=aaron Following the wordlist is the
abhorred(79)=abhorr subsume report. The number
abominations(246)=abomination within paranthesis is how many
... times the word with ending
zulus(1)=zulu occured, before it was
zugs(1)=zug subsumed to its wordstem.
5 % 1 Finally there's the percentage
10 % 3 list. The third line here
15 % 5 means that you only need to
... know 5 words to understand
90 % 4045 15 % of all the words that
95 % 8606 was examined.
WHAT IS IT USEFUL FOR?
Apart from giving you swarms of interesting arrows, it is not very
useful. It will tell you how big your vocabulary is (try comparing
the unique words found (with subsuming disabled) in your letters,
written in your native language versus english.)
If words like "fuck", "lame" or "cool" tops your letters, then
maybe you should consider blushing...
If you are curious about a certain etext, check the first noun; it
will in many cases tell you the whole plot. I have tried this on
five etexts...
NAME MOST COMMON NONE
Alice in Wonderland alice
Hacker's Crackdown computer
Moby Dick whale
The Bible lord
The Koran god
If you have lots of texts in a foreign language that you wish to
study, it could be a good idea to "start from the top"... Let's
say that you didn't understand a word english, that your favourite
author was Lewis Carol, and that you would love to read "Alice in
Wonderland" as it was once originally written. Alice in Wonderland
is 150 kb, and 1083 words = 95 % of it; if the Gutenberg results
represents the english language (where 95 % = 8606 words) you would
"spare" more than 7000 words if you used this method. Hmm...
THE ARGUMENT LINE
All options are case sensitive.
Usage: TextInfo [-<OPTIONS>] <FILE/PATH> <DESTFILE> [ALL]
<FILE/PATH> is either a file or a file pattern.
<DESTFILE> is the name of the output file.
ALL is to be used when you want recursive matching.
Options
-s Disable syllabication
This will disable syllabication. By default syllabication
is used, meaning that words that do not fit on one line,
and thus are separated by a hyphen, will be connected and
treated as one word. For instance "norr-[NEW LINE]sken"
will be listed as "norrsken" when syllabication is on;
otherwise it becomes 2 separate words. Same thing with
"norr-[NEW LINE]-sken" or even "norr-[NEW LINE] sken".
The only thing the syllabication routine requires is a
letter followed by a hyphen and a new line (CR and/or LF);
it will then connect the first found letter or letters.
There is one thing that will abort the syllabication, and
that is when another hyphen is found within the word that
is to be connected. E g "bread-[NEW LINE]and-butter" will
be treated as three words. This is not allways good...
-e Disable subsuming
Use this if you want subsuming to be disabled. Subsuming
is by default conducted on all the words that end on `s',
`ed' or `ing', if, and only if, the stem-word is found.
In "I have walked there, now I walk here" the word
`walked' ends on `ed', and since its word-stem `walk' also
is present, the word is subsumed to `walk'. But in "I
like stars" `stars' is not subsumed to `star' since the
word-stem is not present. The word-stem has to be at
least three chars, so `his' in "I said hi to... what's
his name..." won't be subsumed to `hi'. Subsuming is not
always correct, it would take a dictionary to make it
safe; for instance `cared' in "My boyfriend really cared
for me in his car" will uncorrectly be subsumed to car...
That's why there's a subsume report.
-e<e1>,... Set endings to subsume
This works as the above option (in fact, it is the
same), it disables the default subsuming, but conducts
subsuming on words with the endings that you specify. For
instance "-eed,s,er" will perform subsuming on words
ending with ed, s or er. There are 210 bytes reserved for
the endings, then comes the percentages txt (in memory)...
-r Disable subsume report
Use this option if you do not want a subsume report in the
output file.
-p Disable percentage list
This option will disable the percentage listing in the
output file. By default the amount of words that make up
for 5, 10, 15...upto 95 percent of the total sum of words,
is listed in the end. I haven't checked this routine very
much, so I am not 100 % sure that all the values are
correct.
-n[<n>] Set number of words to list
Use this option to set the number of words to list in
the output file. This does not affect the subsume report.
Note: -n alone will disable the word list.
-l<n> Set lowest number to list.
The number n is required and tells the program to only
list words that occured n times or more. If you would
write "-l10000" only words that occured 10 thousand times
or more will be listed.
-m[<kb>] Set minimum filesize
By default this is set to 50 bytes; that means that
files smaller than 50 bytes will not be examined. If you
don't specify a number, all sizes will be valid, else the
minumum size is set to 1000*<kb> bytes.
-M<kb> Set maximum filesize
As the -m option, but here you _must_ specify a kb
value. Files exceeding this size will not be loaded.
-t<t>,<s> Set number of tabs,size
By default the output words are displayed with a
width of 24 chars, or 3 tabs with the size of 8. The
maximum wordlength is derived from these numbers --
maxlen = tabs*tabsize-1. Number of tabs and tabsize
may not exceed 9999.
-z Don't abort on zero
A textfile shouldn't contain the ascii value zero (except
maybe as EOF-sign) therefore textinfo will by default stop
the examination whenever zero occurs. Use the -z option
to force the program to process all bytes in all files;
this is useful if you have wordprocessor files, with long
headers (bound to hide a zero somewhere).
RUNNING TEXTINFO
When textinfo is started, this is what will happen. First the
argument line is checked; if it is invalid the program will exit
with the short information text. If everything is ok the
destination file will now be opened (and immediately closed) just
to make sure this won't fail after an hour of intense counting.
Now all the files that match the given pattern is checked to
determine the maximum filesize; this amount of memory is then
allocated. If there is not enough memory, the program will exit
with an error message.
This is when the real program starts. The first matching file
will be loaded to the already allocated memory. (Initially I
allocated a memory block just as large as the current file,
processed it, and then freed it, but this turned out to make the
memory heavily fragmented, and in the end there was seldom enough
memory for the (often huge) final wordlist and output file).
The file will be pre-examined in two passes. In the first pass
the whole file will be lowercased and all non-letters will be set
to zero. If syllabication is wanted, this will be done here. In
the second pass _all_ words will be counted and in the same time
truncated if they exceed the maximum allowed wordlength. There is
no progress indicator for this, because it doesn't take much time.
Now the file is safe to examine, and the main routine is called.
It will check a word at a time; if the word has occured before, its
counter will be incremented; if it is a new word, it will be added
and given a counter set to 1. Every 256:th word, the progress
indicator will be updated.
When this is done, two numbers will be displayed: the first is
the amount of words the file contained; the second is how many of
them that had not been found before. The program then loads the
next matching file.
When all files have been examined, the memoryblock used for the
loaded files is freed. The subsuming is now executed, unless it is
disabled. After this the result is rearranged into a large
wordlist. This wordlist is then sorted; all the words that occured
255 times or less is sorted instantly, the rest is bubble sorted.
If you have, say 20 thousand words, that appeared more than 255
times, this will take some time, but normally you will hardly
notice the sorting.
The words are sorted in order of frequency, with the most usual
word (probably `the') in the top. Words that occured an equal
amount of times is sorted after their first 2 characters. Why only
the first two? Because it is just a side-effect of the counting
routine (a nice one for a change!). Although this means that the
listing is in ascii-order, so swedish texts (for instance) will
unfortunally be listed with the ä-words before the å-ones. Ah
well!
Finally the output file will be created and written to the
destination.
ABOUT THE CODE...
I do not wish to brag, so let me just say that the main routine for
counting the various words is amazingly fast. (Right Axehandle?)
However, the additional (boring) routines for making everything
foolproof, and the progress display, have slowed it down somewhat.
Still, if you disable syllabication and subsuming, counting and
sorting all the words in the Koran (0.8MB) and creating an output
file takes only 8 seconds on my A1200 (slightly more than half the
time Dopus needed just to count the lines). The Bible (66 files with
a total of about 4.6MB) takes 50 seconds. (My first version of
this program needed 45 minutes to go through the Koran! The second
version was a bit more efficient, and actually 29 THOUSAND percent
faster (before optimization)! A ratio that would make any
programmer drool...)
The subsume routine has not been optimized.
BUGS???
I have never encountered any bugs in this version. However, I have
only been able to check 90 MB texts at a time, so I cannot be
completely sure how it works on say 1GB of data. The number
displays can only handle 10 digits, or one unsigned longword.
THE PHUTURE
There will probably not be any update on this program, unless I get
one single request from some unknown textinfo user (that's all the
motivation I need!).
It has been suggested to me by Axehandle, that there should be an
option to add results to an already existing wordlist. That way it
would be possible to create one huge wordlist formed by many, many
CD-Roms.
XPK/LZX/LHA/ZIP support would also be very useful, since most
CD:s pack their texts.
CONTACT ME...
If you want an update made, or something else, write...
EMail: blodskam@ebox.tninet.se (valid to end of july 1999,
after that use: blodskam@hotmail.com)
I have, btw, used the handle Parsec since the summer of 1991. I know
many people think handles are silly (albeit /nicks are "kewl"), but...
It's a silly life! Take it seriousley, and *you* are the fool!
HISTORY
v1.0 (960320)
** First public release
v1.1 (971217)
** According to a friend (Thomas Richter I believe) TextInfo
would crash if no endquoute was found in the filepattern.
Fixed this.
v1.2 (980112)
** The progress indicator is now adapted after the shell width.
Thanks to Finn Nielsen for giving me a routine that demonstrated
how to do this.
I still haven't got a request for more features, although I
recieved an email from someone who had at least tried the program.
v1.3 (990107)
** When I wanted to include the output in an email, the mailer
of course couldn't handle the tabs. I tried to circumvent this
by specifying -t32,1, hoping that TextInfo would make the output
32 characters wide... but when it didn't work I vaguely remembered
being too lazy to accept more tabs than 9 (one digit only) and
so I fixed this and made sure that spaces are printed instead
of tabs, if the tabsize is set to 1.
This is all for this release, still no request for more features.
Perhaps the program is perfect now? :)